
For best experience open the pre-rendered .html file
You have 2 options depending on how much you trust my code and how little time you want to dedicate to this review :)
Why you should NOT visualize this notebook on github
Some charts/diagrams/features are not visible in github.
This is standard and well-known behaviour.
If you chose to run this locally, there are some prerequisites:
python 3.9pip install -r requirements.txt before proceeding.Imagine that you are a data scientist who was just hired by the LendingClub. They want to automate their lending decisions fully, and they hired you to lead this project. Your team consists of a product manager to help you understand the business domain and a software engineer who will help you integrate your solution into their product. During the initial investigations, you've found that there was a similar initiative in the past, and luckily for you, they have left a somewhat clean dataset of LendingClub's loan data.
In the first meeting with your team, you all have decided to use this dataset because it will allow you to skip months of work of building a dataset from scratch. In addition, you have decided to tackle this problem iteratively so that you can get test your hypothesis that you can automate these decisions and get actual feedback from the users as soon as possible.
For that, you have proposed a three-step plan on how to approach this problem:
Your team likes the plan, especially because after every step, you'll have a fully-working deployed model that your company can use. Excitedly you get to work!
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import missingno as msno
import pygwalker as pyg
import pygwalker_utils.config as pyg_conf
import statsmodels.api as sm
from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
ConfusionMatrixDisplay,
accuracy_score,
recall_score,
precision_score,
)
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix, PrecisionRecallDisplay
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.base import clone as clone_pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import shap
from imblearn.under_sampling import RandomUnderSampler
import joblib
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split as dask_train_test_split
import os
from os import path
from utils import *
from utils import __
from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split as sklearn_train_test_split
from sklearn.exceptions import DataConversionWarning
import category_encoders as ce
seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
✅
%reload_ext loans_clean
%reload_ext loans_utils
import loans_clean
from loans_utils import *
Let's use black to auto-format all our cells so they adhere to PEP8
import lab_black
%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
from sklearn import set_config
set_config(transform_output="pandas")
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)
warnings.filterwarnings("ignore", category=FutureWarning)
# import warnings
# warnings.filterwarnings('error', category=pd.errors.DtypeWarning)
def ding(title="Ding!", message="Task completed"):
"""
this method only works on linux
"""
for i in range(2):
!notify-send '{title}' '{message}'
Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)
Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.
def run_entire_notebook():
run_all = False
if not run_all:
print("skipping optional operation")
return run_all
def perform_split_once_off():
split_files = False
if not split_files:
print("skipping once-off splitting operation")
return split_files
Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)
Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.
def run_entire_notebook():
run_all = False
if not run_all:
print("skipping optional operation")
return run_all
def perform_split_once_off():
split_files = False
if not split_files:
print("skipping once-off splitting operation")
return split_files
The dataset from Turing College does not include a link to the kaggle page for the dataset. I initially thought it was because it's a custom-self hosted model that was not available on kaggle.
The thing is, it would be really nice if we could find it on Kaggle, as it normally contains valuable information on the dataset, procedence, insights, data dictionary, etc...
So a quick google search showed that the dataset might be the one hosted in here: https://www.kaggle.com/datasets/wordsforthewise/lending-club
We downloaded both and compared the file contents, to make sure they are in fact the same.

Now that we know for sure that this kaggle dataset is the same one that was given to us by Turing, we're ready to start!
We're ready to start! Let's download the dataset from Kaggle.
dataset_name = "wordsforthewise/lending-club"
db_filename = "accepted_raw.csv"
auto_kaggle.download_dataset(dataset_name, db_filename, timeout_seconds=120)
__
File [dataset/accepted_raw.csv] already exists locally! No need to re-download dataset [wordsforthewise/lending-club]
This dataset has an unusual structure. There seem to be 4 items.
We could assume that the uncompressed elements are the only needed ones, and that the ZIP files are redundant... but they were all inside the main download, so we want to check them to see:

Are they the same?
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv rejected_2007_to_2018q4.csv original_rejected.csv (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv accepted_2007_to_2018q4.csv original_accepted.csv (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv accepted_2007_to_2018Q4.csv.gz zip_accepted.csv.gz (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv rejected_2007_to_2018Q4.csv.gz zip_rejected.csv.gz (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ gzip --decompress zip_accepted.csv.gz (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ gzip --decompress zip_rejected.csv.gz (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ sha256sum zip_accepted.csv original_accepted.csv/accepted_2007_to_2018Q4.csv 3eae03c28fd9d2e8a076ebeb73507e8d4d0f44d90500decdb0936e0933d1f36a zip_accepted.csv 3eae03c28fd9d2e8a076ebeb73507e8d4d0f44d90500decdb0936e0933d1f36a original_accepted.csv/accepted_2007_to_2018Q4.csv (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ sha256sum zip_rejected.csv original_rejected.csv/rejected_2007_to_2018Q4.csv 07eb8468d55340d8ca4145c3e3c2e2d3e25ff83c44e432a825729ee6c99c4d45 zip_rejected.csv 07eb8468d55340d8ca4145c3e3c2e2d3e25ff83c44e432a825729ee6c99c4d45 original_rejected.csv/rejected_2007_to_2018Q4.csv (base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$
Turns out, they are the exact same files ✅
The downloaded dataset has duplicated files, and we can just keep one copy of each (1 x accepted + 1 x rejected), and drop the rest.
if run_entire_notebook():
accepted_raw = "dataset/accepted_raw.csv"
rejected_raw = "dataset/rejected_raw.csv"
if path.exists("dataset/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv"):
!mv dataset/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv $accepted_raw
if path.exists("dataset/rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv"):
!mv dataset/rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv $rejected_raw
folders_to_deleted = [
"dataset/accepted_2007_to_2018Q4.csv.gz",
"dataset/rejected_2007_to_2018Q4.csv.gz",
"dataset/accepted_2007_to_2018q4.csv",
"dataset/rejected_2007_to_2018q4.csv",
]
for folder in folders_to_deleted:
if path.exists(folder):
print(f"deleting empty folder [{folder}]")
!rm -d "$folder_to_delete"
skipping optional operation
Trying to load the entire file into pandas crashes the jupyter kernel, so we will try some other options, for our initial inspection:
Although, it seems to take a serious toll on our system memory+swap

On the positive side, it's good to know that it actually works in raw iPython:
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset/accepted_2007_to_2018q4.csv$ ipython
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: pd.read_csv("accepted_2007_to_2018Q4.csv", index_col=0)
<ipython-input-2-c093d3440acc>:1: DtypeWarning: Columns (0,19,49,59,118,129,130,131,134,135,136,139,145,146,147) have mixed types. Specify dtype option on import or set low_memory=False.
pd.read_csv("accepted_2007_to_2018Q4.csv", index_col=0)
Out[2]:
member_id loan_amnt funded_amnt funded_amnt_inv ... settlement_date settlement_amount settlement_percentage settlement_term
id ...
68407277 NaN 3600.0 3600.0 3600.0 ... NaN NaN NaN NaN
68355089 NaN 24700.0 24700.0 24700.0 ... NaN NaN NaN NaN
68341763 NaN 20000.0 20000.0 20000.0 ... NaN NaN NaN NaN
66310712 NaN 35000.0 35000.0 35000.0 ... NaN NaN NaN NaN
68476807 NaN 10400.0 10400.0 10400.0 ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
88985880 NaN 40000.0 40000.0 40000.0 ... NaN NaN NaN NaN
88224441 NaN 24000.0 24000.0 24000.0 ... Mar-2019 10000.0 44.82 1.0
88215728 NaN 14000.0 14000.0 14000.0 ... NaN NaN NaN NaN
Total amount funded in policy code 1: 1465324575 NaN NaN NaN NaN ... NaN NaN NaN NaN
Total amount funded in policy code 2: 521953170 NaN NaN NaN NaN ... NaN NaN NaN NaN
[2260701 rows x 150 columns]
In [3]: _
But let's try to keep our analysis in Jupyter, just for the sake of the reading experience.
While we're on the terminal, let's also take a look at the lines at the end of the file
if run_entire_notebook():
!tail -n 3 "$accepted_raw"
skipping optional operation
A few things we noticed (so far):
💡 We will drop any lines that have too many/few commas, caused by unescaped text (comments, etc...)
We will output a random shuffle of the file contents to a separate file, to take a quick look at the data, without crashing our kernel.
NOTE: These are *nix/linux commands. This next cell will only work in *nix-compatible systems. That is: It's unlikely to work in Windows, might work on Mac (if you install additional packages) but it's not tested.
They create a new sampled file with a few 100k rows from each file. These files will be used for our initial exploration, only! no model training will be done with this random data.
def drop_lines_with_wrong_commas(file_in: str, file_out: str, expected_commas: int):
if path.exists(file_out):
print(f"cleaned file already exists [{file_out}]. Skipping operation.")
return
print(f"dropping malformed rows from {file_in}: ", end="")
rows_skipped = 0
with open(file_in, "r") as inp, open(file_out, "w") as out:
for row in inp:
if row.count(",") == expected_commas:
out.write(row)
else:
rows_skipped += 1
print(f"rows dropped: {rows_skipped}")
if run_entire_notebook():
acc_lines = !wc -l dataset/accepted_raw.csv
rej_lines = !wc -l dataset/rejected_raw.csv
print(acc_lines[0])
print(rej_lines[0])
skipping optional operation
clean_accepted_raw = "dataset/clean_accepted_raw.csv"
clean_rejected_raw = "dataset/clean_rejected_raw.csv"
if run_entire_notebook():
drop_lines_with_wrong_commas(accepted_raw, clean_accepted_raw, expected_commas=150)
drop_lines_with_wrong_commas(rejected_raw, clean_rejected_raw, expected_commas=8)
del accepted_raw
del rejected_raw
skipping optional operation
Let's sample a subset of the cleaned files so we can get a general feel for the type of data inside.
sample_accepted = "dataset/sample_clean_accepted.csv"
sample_rejected = "dataset/sample_clean_rejected.csv"
if run_entire_notebook():
assert clean_accepted_raw
if not path.exists(sample_accepted):
!head -n 1 "$clean_accepted_raw" > "$sample_accepted"
!shuf -n 100000 "$clean_accepted_raw" >> "$sample_accepted"
# since we are not including the `--repeat` flag, shuf will NOT output repeated lines.
assert clean_rejected_raw
if not path.exists(sample_rejected):
!head -n 1 $clean_rejected_raw > $sample_rejected
!shuf -n 100000 $clean_rejected_raw >> $sample_rejected
skipping optional operation
Just in the interest of visualizing the sizes we are dealing with, let's count the number of lines for each file:
if run_entire_notebook():
for file in [
clean_accepted_raw,
clean_rejected_raw,
sample_accepted,
sample_rejected,
]:
lines = !wc -l $file
print(lines[0])
skipping optional operation
# delete the variables from the scope, not the files ;)
if run_entire_notebook():
del clean_accepted_raw
del clean_rejected_raw
skipping optional operation
At this stage, it would be helpful to find the data dictionary for this dataset.
The original source page is gone (HTTP 403 FORBIDDEN), so we will look for the next best thing: someone's backup of the dictionary.
We can readily find a copy of the files here: https://www.kaggle.com/datasets/jonchan2003/lending-club-data-dictionary
data_dictionary = "jonchan2003/lending-club-data-dictionary"
data_dictionary_check = "Lending Club Data Dictionary Approved.csv"
auto_kaggle.download_dataset(
data_dictionary,
data_dictionary_check,
subdir="dataset/dictionary",
timeout_seconds=120,
)
__
File [dataset/dictionary/Lending Club Data Dictionary Approved.csv] already exists locally! No need to re-download dataset [jonchan2003/lending-club-data-dictionary]
if run_entire_notebook():
dict_files = [
"Approved.csv",
"Notes.csv",
"Reject.csv",
]
for file in dict_files:
source = f"dataset/dictionary/Lending Club Data Dictionary {file}"
target = f"dataset/dictionary/datadict_{file.lower()}"
if not path.exists(target):
print(f"converting to utf-8 and cleaning {file}")
f = pd.read_csv(source, encoding="Windows-1252", index_col=0)
f = f.dropna(axis=1, how="all")
f.to_csv(target)
skipping optional operation
pd.read_csv("dataset/dictionary/datadict_approved.csv")
| LoanStatNew | Description | |
|---|---|---|
| 0 | acc_now_delinq | The number of accounts on which the borrower i... |
| 1 | acc_open_past_24mths | Number of trades opened in past 24 months. |
| 2 | addr_state | The state provided by the borrower in the loan... |
| 3 | all_util | Balance to credit limit on all trades |
| 4 | annual_inc | The self-reported annual income provided by th... |
| ... | ... | ... |
| 148 | settlement_amount | The loan amount that the borrower has agreed t... |
| 149 | settlement_percentage | The settlement amount as a percentage of the p... |
| 150 | settlement_term | The number of months that the borrower will be... |
| 151 | NaN | NaN |
| 152 | NaN | * Employer Title replaces Employer Name for al... |
153 rows × 2 columns
The sample datasets we created earlier contain 100k rows.
The idea is to use these for the general exploration to get a sense of the type of data in them.
Once we have a plan, we can use the files with all the data, so we get more data to train our models.
accepted_df_ = pd.read_csv(sample_accepted)
rejected_df_ = pd.read_csv(sample_rejected)
Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
accepted_df_.head()
| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | ... | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 55211652 | NaN | 13200.0 | 13200.0 | 13200.00 | 36 months | 11.53 | 435.48 | B | B5 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100629169 | NaN | 13500.0 | 13500.0 | 13500.00 | 36 months | 11.44 | 444.80 | B | B4 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 132825772 | NaN | 13000.0 | 13000.0 | 13000.00 | 36 months | 11.98 | 431.67 | B | B5 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 70172881 | NaN | 15000.0 | 15000.0 | 15000.00 | 36 months | 22.45 | 576.36 | E | E5 | ... | NaN | NaN | DirectPay | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 169516 | NaN | 4000.0 | 4000.0 | 690.91 | 36 months | 14.70 | 138.08 | E | E5 | ... | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 151 columns
rejected_df_.head()
| Amount Requested | Application Date | Loan Title | Risk_Score | Debt-To-Income Ratio | Zip Code | State | Employment Length | Policy Code | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000.0 | 2018-06-01 | Other | NaN | 6% | 070xx | NJ | < 1 year | 0.0 |
| 1 | 10000.0 | 2018-09-12 | Other | NaN | 16.91% | 280xx | NC | < 1 year | 0.0 |
| 2 | 3000.0 | 2017-09-30 | Debt consolidation | 697.0 | 39.14% | 601xx | IL | < 1 year | 0.0 |
| 3 | 14500.0 | 2018-11-07 | Debt consolidation | NaN | 41.26% | 180xx | PA | < 1 year | 0.0 |
| 4 | 10000.0 | 2018-04-17 | Debt consolidation | NaN | 14.79% | 903xx | CA | < 1 year | 0.0 |
Contrary to what it may seem, the 2 files (accepted and rejected) are not about "accepting a loan", but about which loans were accepted into the platform (lending club dot com).
We are only interested in the first ones.
We will unload unnecessary data from memory.
if run_entire_notebook():
handy_utils.inspect_scope(globals())
skipping optional operation
if run_entire_notebook():
del rejected_df_
del sample_rejected
skipping optional operation
from IPython import get_ipython
if run_entire_notebook():
get_ipython().run_line_magic("reset", "-f out")
skipping optional operation
if run_entire_notebook():
handy_utils.inspect_scope(globals())
skipping optional operation
Nice... so much cleaner!
In this analysis, we will take a high-level overview of our data (from a random subset of the data).
The goal of this analysis is to get an understanding of what it looks like.
Once we are happy that we understand what the data looks like, we will split the entire dataset into train/val/test splits, we will train our models on the train split and validate/test on the other two.
Since the dataset is quite large, we might even try to use online training to improve our model with small batches, sequentially, instead of trying to train it all at once.
accepted_exploratory_sample = accepted_df_.sample(10000)
ydata is an automated tool that we can use to get a general idea of our dataset, without requiring much code.
It is particularly useful when we have to automate data analysis for cover very large datasets (millions of rows, hundreds of columns), like the one we are working on.
We will also cache the report to disk so that it is available for future runs (saving valuable time!)
@cached_with_pickle()
def ydata_report_data():
report = ydata_utils.report(
accepted_exploratory_sample, config_file="config/ydata_custom_config.yaml"
)
return report
@cached_profile_report()
def ydata_report():
return ydata_report_data()
if run_entire_notebook():
display(ydata_report())
skipping optional operation Loading from cache [./cached/profile_report/ydata_report.html]
Pygwalker is a tableau-like tool that provides an easy interface to quickly and easily generate lots of charts. Considering that we've spent the last 11 sprints using PLT and Seaborn, i was looking to explore additional tools that simplify EDA and remove some of the burden.
The ideal tool would have:
Since we're not looking to create beautiful charts for presenting to stakeholders, and we only care about the insights, we will not be customizing anything in terms of theme. This is just for us ;)
Let's pick a small sample of the entire dataset to perform interactive EDA. The goal is not to get ALL of the insights, but just the most obvious one. We don't care if we miss something, we are trying to get to useful conclusions, quickly and easily.
Since pygwalker struggles with large datasets, 2000 rows should be enough for now.
Before we start using PygWalker, let's configure its privacy policy so that no user data is "accidentally" sent out from our computer. Read more about it in the official page
pyg_conf.set_config({"privacy": "offline"}, save=True)
We are ready to go!
We will load the results from a previously saved session, so that the notebook renders them all instead of starting from scratch.
if run_entire_notebook() or True:
pyg.walk(
accepted_exploratory_sample,
dark="light",
spec="config/pygwalker/config.json",
)
skipping optional operation